Skip to content

Major overhaul of mbstring (part 31) #10591

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Conversation

alexdowad
Copy link
Contributor

One more step towards eliminating the legacy encoding conversion code.

mb_decode_mimeheader is now around 3 times faster.

FYA @cmb69 @Girgias @nikic @kamil-tekiela @youkidearitai

The new implementation is 2.5x-3x faster.

If an invalid charset name was used, the old implementation would get
'stuck' trying to parse the charset name and would not interpret any
other MIME encoded words up to the end of the input string. The new
implementation fixes this bug.

If an (invalid) encoded word ends abruptly and a new (valid) encoded
word starts, the old implementation would not decode the valid encoded
word. The new implementation also fixes this.

Otherwise, the behavior of the new implementation has been designed to
closely match that of the old implementation.
@@ -5641,3 +5613,232 @@ static void php_mb_gpc_set_input_encoding(const zend_encoding *encoding) /* {{{
MBSTRG(http_input_identify) = (const mbfl_encoding*)encoding;
}
/* }}} */

static int8_t decode_base64(unsigned char c)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can standard base64 decode function be used here? especially with #6361

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I'll have to check what its interface looks like and see if it fits what is needed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the functions provided by base64.c take a pointer/length pair (good) and return a zend_string (not what I wanted).

But maybe I can adjust the code to use a zend_string. Let me think about that a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps it's possible to make the other code internally work with a pointer/length pair. And let the existing callers of that code work with a wrapper for zend_string that extracts the pointer & length.

@kamil-tekiela
Copy link
Member

It would be good to know what this function does. 😄 The PHP manual isn't very verbose https://www.php.net/manual/en/function.mb-decode-mimeheader.php
This comment actually seems to help a little bit https://www.php.net/manual/en/function.mb-decode-mimeheader.php#107692
I wonder if there is a functional specification for this. Did you adhere to some documentation?

@alexdowad
Copy link
Contributor Author

It would be good to know what this function does. smile The PHP manual isn't very verbose https://www.php.net/manual/en/function.mb-decode-mimeheader.php This comment actually seems to help a little bit https://www.php.net/manual/en/function.mb-decode-mimeheader.php#107692 I wonder if there is a functional specification for this. Did you adhere to some documentation?

@kamil-tekiela Thanks for the comment. This might be helpful: https://en.wikipedia.org/wiki/MIME; scroll down to the subheading "Encoded-Word".

The MIME standards (defined in various IETF RFCs) allow e-mail header values to include non-ASCII characters, by encoding them in a special format. mb_decode_mimeheader decodes this special format to an 'ordinary' string.

The input string for mb_decode_mimeheader should be the value of one field-value pair from an e-mail message's headers. It does not accept all the headers at once; the user has to break the headers up into field names and values and pass the values one by one through mb_decode_mimeheader to decode them.

I briefly mentioned the relevant standard in the code comments:

/* Decode MIME encoded word as defined in RFC 2047 */

@alecpl
Copy link

alecpl commented Feb 15, 2023

FYI, there's also iconv_mime_encode/iconv_mime_decode. Also, I'd like to note Mail_Mime does not use any of these https://github.com/pear/Mail_Mime/blob/master/Mail/mimePart.php#L853

@alexdowad
Copy link
Contributor Author

FYI, there's also iconv_mime_encode/iconv_mime_decode. Also, I'd like to note Mail_Mime does not use any of these https://github.com/pear/Mail_Mime/blob/master/Mail/mimePart.php#L853

Thanks for pointing that out!

I really wonder why we have two different built-in functions in PHP which do the same thing.

@alecpl
Copy link

alecpl commented Feb 15, 2023

I really wonder why we have two different built-in functions in PHP which do the same thing.

Yeah. I don't know. It might be that mbstring wasn't always been a "core extension". Also iconv supports a different (wider?) set of charsets.

@youkidearitai
Copy link
Contributor

youkidearitai commented Feb 15, 2023

I found blog that about talk PHP 8.0 and PHP 8.1 or later of mb_encode_mimeheader and mb_decode_mimeheader result is different. That blog is https://blog.cretbird.co.jp/mb_encode_mimeheader-php81.html.

Imitate this behavior, I confirmed different to this PR and PHP 8.1 or later and PHP 8.0. Code is below.

<?php
var_dump(
    mb_decode_mimeheader(
        mb_encode_mimeheader(
            mb_convert_encoding(
                "あいうえおかきくけこさしすせそたちつてとなにぬねのはひふへほまみむめもやいゆえよらりるれろわゐうゑをん",
                "ISO-2022-JP",
                "UTF-8"
            ),
            "ISO-2022-JP"
        )
    )
);

This PR result is string(152) "あいうえおかきくけこさしすせそたちつて?箸覆砲未佑里呂劼佞悗曚泙澆爐瓩發笋い罎┐茲蕕蠅襪譴蹐錣陲Δ颪鬚??", but PHP 8.1 or PHP 8.0 is 3v4l here. Why is different behavior?

@youkidearitai
Copy link
Contributor

Result of mb_encode_mimeheader is not seems different. https://3v4l.org/BZEYU

@zbenc
Copy link

zbenc commented Feb 15, 2023

@MaxKellermann There is also the following known issue with this function: https://bugs.php.net/bug.php?id=68821

This issue makes it behave differently to imap_mime_header_decode(), which IMO should be equivalent.

So in order to properly handle MIME mail headers one must use this:

function correctDecodeMIME($string) {
	$string = \preg_replace_callback('/(=\?[^?]++\?[Qq]\?)([^?]++)(\?=|$)/D', static function ($match) {
		return $match[1] . \str_replace('_', '=20', $match[2]) . $match[3];
	}, $string);

	return \mb_decode_mimeheader($string);
}

@alexdowad
Copy link
Contributor Author

@MaxKellermann There is also the following known issue with this function: https://bugs.php.net/bug.php?id=68821

This issue makes it behave differently to imap_mime_header_decode(), which IMO should be equivalent.

Thanks for pointing that out. I reviewed the relevant RFC and you are 100% right; mb_decode_mimeheader does not handle underscores correctly.

I have added a fix to this PR.

@alexdowad
Copy link
Contributor Author

This PR result is string(152) "あいうえおかきくけこさしすせそたちつて?箸覆砲未佑里呂劼佞悗曚泙澆爐瓩發笋い罎┐茲蕕蠅襪譴蹐錣陲Δ颪鬚??", but PHP 8.1 or PHP 8.0 is 3v4l here. Why is different behavior?

@youkidearitai As usual, thanks for the great review! You have uncovered an interesting issue here.

I haven't completed my investigation yet, but one first point: the test code which you kindly provided is buggy. The documentation for mb_encode_mimeheader says:

string
The string being encoded. Its encoding should be same as [mb_internal_encoding()](https://www.php.net/manual/en/function.mb-internal-encoding.php).

In the code sample, mb_internal_encoding is not ISO-2022-JP, but you are passing an ISO-2022-JP string to mb_encode_mimeheader.

Here is a sample where mb_internal_encoding is set correctly: https://3v4l.org/pCV0d

@alexdowad
Copy link
Contributor Author

alexdowad commented Feb 17, 2023

@youkidearitai OK, more findings...

In PHP 8.1, I adjusted the mbstring conversion code for some legacy text encodings to flag errors rather than ignoring them. For example, in ISO-2022-JP, various encoding modes can be selected using escape sequences like \x1B$B. Now, consider this ISO-2022-JP string:

"abc\x1B"

How should we treat this string? The useless \x1B byte should not be there; \x1B is expected to start an escape sequence in ISO-2022-JP, but this one does not.

When PHP 8.0 mbstring processes this string, it ignores the trailing ESC byte and treats the string as valid. (mb_get_info('illegal_chars') is not incremented.) PHP 8.1 mbstring treats the trailing ESC byte as an error, outputs an error marker, and increments mb_get_info('illegal_chars').

This is the reason why PHP 8.1 mb_decode_mimeheader produces an extra ? character in the test case which you found; it's because there is an invalid escape sequence in the string returned by mb_encode_mimeheader.

Now, why do we have an invalid escape sequence in the string returned by mb_encode_mimeheader?? Again, it's because you set mb_internal_encoding incorrectly. Because mb_internal_encoding is set to UTF-8, mb_encode_mimeheader treats your ISO-2022-JP string as UTF-8 (and actually, it is valid UTF-8).

In ISO-2022-JP, "\x1B$B" is a single escape sequence, which cannot be divided without changing the meaning. But in UTF-8, those are just 3 ordinary single-byte characters. Because mb_encode_mimeheader is treating the input string as UTF-8, it has no problem in splitting up those 3 bytes, putting the first one in the first MIME encoded word and the remaining ones in the second MIME encoded word.

So basically:

  1. Because mb_internal_encoding was set wrongly, mb_encode_mimeheader produced wrong output. This wrong output was the same for both PHP 8.0 and PHP 8.1.
  2. When mb_decode_mimeheader sees the invalid input, it ignores the problem in PHP 8.0, but in PHP 8.1 it alerts you to the problem by producing an error marker.

@alexdowad
Copy link
Contributor Author

I found blog that about talk PHP 8.0 and PHP 8.1 or later of mb_encode_mimeheader and mb_decode_mimeheader result is different. That blog is https://blog.cretbird.co.jp/mb_encode_mimeheader-php81.html.

Solution #1 mentioned by Satou-san in the above-referenced blog post is the correct one:

1.mb_internal_encodingをISO-2022-JPに指定し、変換処理後に元に戻す

He says in the post that when he tried this, he found some strings which were corrupted when passed through mb_encode_mimeheader and mb_decode_mimeheader. But unfortunately, he doesn't provide any test cases, so we can't diagnose what actually went wrong.

@alexdowad
Copy link
Contributor Author

Solution 1 mentioned by Satou-san in the above-referenced blog post is the correct one:

There's actually another solution which it seems Satou-san didn't think about: provide a UTF-8 header, and let mb_encode_mimeheader convert it to ISO-2022-JP. Like this:

mb_internal_encoding('UTF-8');
$str = "This is a UTF-8 string. あいうえお。您好。";
mb_encode_mimeheader($str, 'ISO-2022-JP');

@youkidearitai
Copy link
Contributor

But unfortunately, he doesn't provide any test cases, so we can't diagnose what actually went wrong.

Indeed, Satou-san only presentation some one, we can't know what to do. We tried many workaround because multibyte mail is often mojibake.

@alexdowad
Copy link
Contributor Author

Indeed, Satou-san only presentation some one, we can't know what to do. We tried many workaround because multibyte mail is often mojibake.

Sorry to hear that. I am ready to fix anything which is wrong, but first, we need test cases which demonstrate the problem.

You have provided one very nice test case, which demonstrates very well what can happen when mb_internal_encoding is wrongly set. Now we need more test cases which show how mojibake can occur even when mb_internal_encoding is set correctly.

If you, or anyone else, can come forward with such a test case, I am ready to analyze the root cause. Understanding the root cause is very necessary before anything can be done about it.

@alexdowad
Copy link
Contributor Author

Hmm. I was hoping that maybe @youkidearitai might come up with some other test cases related to the reports of mojibake, but it seems that none are forthcoming now...

I am inclined to merge this PR "as is" and leave the issue of using the SIMD-accelerate Base64 decoder as a possible future enhancement.

If anyone strongly feels we need SIMD-accelerated mb_decode_mimeheader now and not later, please speak up....

Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know about mime header encoding, but looks reasonable.

@alexdowad alexdowad closed this Feb 22, 2023
@alexdowad alexdowad deleted the mime branch February 22, 2023 21:21
@alexdowad
Copy link
Contributor Author

Landed... thanks all.

@youkidearitai
Copy link
Contributor

Hmm. I was hoping that maybe @youkidearitai might come up with some other test cases related to the reports of mojibake, but it seems that none are forthcoming now...

I'm sorry for waiting me. But mail mojibake know-how is almost closed part of company. The problems reported in php-src only a fraction...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants